Demo

Abstract

The purpose of this report will be to use the Scraped Data and use this scraped glassdoor salary data to predict salary based on various parameters provided.

This can be used to gain insight into how and why salary varies depending on location and job role etc. This can also be used as a model to gain a marketing advantage, by advertisement targeting those who are more likely to finding jobs in the respective domain to know skillsets required and experience required. Salary Prediction is a regression problem, where using the scraped data and model building will predict what range of salary one can receive. This is diving into Salary Prediction through Machine Learning Concept. End to End Project means that it is step by step process, starts with data collection, EDA, Data Preparation which includes cleaning and transforming then selecting, training and saving ML Models, Cross-validation and Hyper-Parameter Tuning and developing web service then Deployment for end users to use it anytime and anywhere.

This repository contains the code for Salary Prediction using python’s various libraries. It used numpy, pandas, matplotlib, seaborn, sklearn, time, joblib, selenium and pickle libraries. These libraries help to perform individually one particular functionality. Pandas objects rely heavily on Numpy objects. Numpy is used for working with arrays. It stands for Numerical Python. Matplotlib is a plotting library. Seaborn is data visualization library based on matplotlib. Sklearn has 100 to 200 models. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream. Time provides many ways of representing time in code, such as objects, numbers, and strings. Joblib is a set of tools to provide lightweight pipelining in Python. The purpose of creating this repository is to gain insights into Complete ML Project. These python libraries raised knowledge in discovering these libraries with practical use of it. It leads to growth in my ML repository. These above screenshots and video in Video_File Folder will help you to understand flow of output.

Motivation

The reason behind building this is, because salary is one of the important factors of job and life and therefore every one of us look at the salary provided by the company before applying or before joining the company. It is also one of the strong factors responsible for performance. When it comes to salary and pay for performance schemes companies need to consider careful what it is they want to reward and in how far there might be different ways to motivate and engage employees. Another reason is that, till now I have worked on individual concepts so I wanted to combine all things that I have learnt till now and create a End to End Project that shows whole life cycle of ML Project. In addition to that, as an employee of a company, I should be able to carry out an entire process own is also essential aspect of it. Building end to end project is gave me wholesome approach to handle given data. Hence, I continue to gain knowledge while practicing the same and spread literary wings in tech-heaven.

The Data

It displays name of the columns.

It displays number of unique categories in a particular column.

It shows 742 total observations and 28 columns.

It displays missing values if any.

It displays number of rows and columns.

Dropped not required columns and then displays final number of rows and columns used.

Analysis of Data

Let’s start by doing a general analysis of the data as a whole, including all the features the Linear Regression algorithm will be using.

Basic Statistics

Graphing of Features

Graph Set 1

Graph Set 2

Graph Set 3

Graph Set 4

Graph Set 5

Graph Set 6

Modelling

The Math behind the metrics

Linear Regression is a predictive algorithm which provides a Linear relationship between Prediction (Call it ‘Y’) and Input (Call is ‘X’).

As we know from the basic maths that if we plot an ‘X’,’Y’ graph, a linear relationship will always come up with a straight line. For example, if we plot the graph of these values.

Model Architecture Process Through Visualization

Linear Regression Architecture:

Quick Notes

Step 1: Scraped the data of glassdoor (glassdoor_scraper.py)and collected data in glassdoor_jobs.csv file.

- Initialized web driver.

- Loaded the page.

- Tested the “sign in” prompt and got rid of it.

- See through list of job.

- Set to “not found” value.

- Converted dictionary objects into pandas dataframe.

- Collected data in csv file.

Step 2: Performed data cleaning (data_cleaning.py)on glassdoor_jobs.csv and created salary_data_cleaned.csv file.

- Read stored csv file to clean the data.

- Performed salary parsing.

- Performed operation like ‘apply’ on columns such as “company name”, “state”, “age” etc and stored data in csv file.

Step 3: Performed EDA (data_eda.ipynb) on salary_data_cleaned.csv and created eda_data.csv file.

- Read the cleaned data csv file.

- Analyzed the data through “.columns, .value_counts” and other operations.

- Visualized data through histograms, box plots and heat maps.

Step 4: Built model (model_building.ipynb) on eda_data.csv and created simple_data.csv file and model_pipeline.pkl.

- Read the eda csv file.

- Dropped columns those were not required.

- Created and stored data in new csv file.

- Created dummy variables using “.get_dummies()” function.

- Created pipeline and Fitted a linear model on trained data and for the same.

- Saved the model as pickle file to re-use it.

Step 5: Generated web service in streamlit (salary_prediction.py) on simple_data.csv file and model_pipeline.pkl file.

- Created data frame for required fields. And then predicted the salary value.

The Model Analysis

from selenium.common.exceptions import NoSuchElementException, ElementClickInterceptedException
from selenium import webdriver
import time
import pandas as pd


def get_jobs(keyword, num_jobs, verbose, path, slp_time):
    
    '''Gathers jobs as a dataframe, scraped from Glassdoor'''
    
    #Initializing the webdriver
    options = webdriver.ChromeOptions()
    
    #Uncomment the line below if you'd like to scrape without a new Chrome window every time.
    #options.add_argument('headless')
    
    #Change the path to where chromedriver is in your home folder.
    driver = webdriver.Chrome(executable_path=path, options=options)
    driver.set_window_size(1120, 1000)
    
    url = "https://www.glassdoor.com/Job/jobs.htm?suggestCount=0&suggestChosen=false&clickSource=searchBtn&typedKeyword="+keyword+"&sc.keyword="+keyword+"&locT=&locId=&jobType="
    #url = 'https://www.glassdoor.com/Job/jobs.htm?sc.keyword="' + keyword + '"&locT=C&locId=1147401&locKeyword=San%20Francisco,%20CA&jobType=all&fromAge=-1&minSalary=0&includeNoSalaryJobs=true&radius=100&cityId=-1&minRating=0.0&industryId=-1&sgocId=-1&seniorityType=all&companyId=-1&employerSizes=0&applicationType=0&remoteWorkType=0'
    driver.get(url)
    jobs = []

    while len(jobs) < num_jobs:  #If true, should be still looking for new jobs.

        #Let the page load. Change this number based on your internet speed.
        #Or, wait until the webpage is loaded, instead of hardcoding it.
        time.sleep(slp_time)

        #Test for the "Sign Up" prompt and get rid of it.
        try:
            driver.find_element_by_class_name("selected").click()
        except ElementClickInterceptedException:
            pass

        time.sleep(.1)

        try:
            driver.find_element_by_css_selector('[alt="Close"]').click() #clicking to the X.
            print(' x out worked')
        except NoSuchElementException:
            print(' x out failed')
            pass

        
        #Going through each job in this page
        job_buttons = driver.find_elements_by_class_name("jl")  #jl for Job Listing. These are the buttons we're going to click.
        for job_button in job_buttons:  

            print("Progress: {}".format("" + str(len(jobs)) + "/" + str(num_jobs)))
            if len(jobs) >= num_jobs:
                break

            job_button.click()  #You might 
            time.sleep(1)
            collected_successfully = False
            
            while not collected_successfully:
                try:
                    company_name = driver.find_element_by_xpath('.//div[@class="employerName"]').text
                    location = driver.find_element_by_xpath('.//div[@class="location"]').text
                    job_title = driver.find_element_by_xpath('.//div[contains(@class, "title")]').text
                    job_description = driver.find_element_by_xpath('.//div[@class="jobDescriptionContent desc"]').text
                    collected_successfully = True
                except:
                    time.sleep(5)

            try:
                salary_estimate = driver.find_element_by_xpath('.//span[@class="gray salary"]').text
            except NoSuchElementException:
                salary_estimate = -1 #You need to set a "not found value. It's important."
            
            try:
                rating = driver.find_element_by_xpath('.//span[@class="rating"]').text
            except NoSuchElementException:
                rating = -1 #You need to set a "not found value. It's important."

            #Printing for debugging
            if verbose:
                print("Job Title: {}".format(job_title))
                print("Salary Estimate: {}".format(salary_estimate))
                print("Job Description: {}".format(job_description[:500]))
                print("Rating: {}".format(rating))
                print("Company Name: {}".format(company_name))
                print("Location: {}".format(location))

            #Going to the Company tab...
            #clicking on this:
            #
Company
try: driver.find_element_by_xpath('.//div[@class="tab" and @data-tab-type="overview"]').click() try: #
# # San Francisco, CA #
headquarters = driver.find_element_by_xpath('.//div[@class="infoEntity"]//label[text()="Headquarters"]//following-sibling::*').text except NoSuchElementException: headquarters = -1 try: size = driver.find_element_by_xpath('.//div[@class="infoEntity"]//label[text()="Size"]//following-sibling::*').text except NoSuchElementException: size = -1 try: founded = driver.find_element_by_xpath('.//div[@class="infoEntity"]//label[text()="Founded"]//following-sibling::*').text except NoSuchElementException: founded = -1 try: type_of_ownership = driver.find_element_by_xpath('.//div[@class="infoEntity"]//label[text()="Type"]//following-sibling::*').text except NoSuchElementException: type_of_ownership = -1 try: industry = driver.find_element_by_xpath('.//div[@class="infoEntity"]//label[text()="Industry"]//following-sibling::*').text except NoSuchElementException: industry = -1 try: sector = driver.find_element_by_xpath('.//div[@class="infoEntity"]//label[text()="Sector"]//following-sibling::*').text except NoSuchElementException: sector = -1 try: revenue = driver.find_element_by_xpath('.//div[@class="infoEntity"]//label[text()="Revenue"]//following-sibling::*').text except NoSuchElementException: revenue = -1 try: competitors = driver.find_element_by_xpath('.//div[@class="infoEntity"]//label[text()="Competitors"]//following-sibling::*').text except NoSuchElementException: competitors = -1 except NoSuchElementException: #Rarely, some job postings do not have the "Company" tab. headquarters = -1 size = -1 founded = -1 type_of_ownership = -1 industry = -1 sector = -1 revenue = -1 competitors = -1 if verbose: print("Headquarters: {}".format(headquarters)) print("Size: {}".format(size)) print("Founded: {}".format(founded)) print("Type of Ownership: {}".format(type_of_ownership)) print("Industry: {}".format(industry)) print("Sector: {}".format(sector)) print("Revenue: {}".format(revenue)) print("Competitors: {}".format(competitors)) print("@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@") jobs.append({"Job Title" : job_title, "Salary Estimate" : salary_estimate, "Job Description" : job_description, "Rating" : rating, "Company Name" : company_name, "Location" : location, "Headquarters" : headquarters, "Size" : size, "Founded" : founded, "Type of ownership" : type_of_ownership, "Industry" : industry, "Sector" : sector, "Revenue" : revenue, "Competitors" : competitors}) #add job to jobs #Clicking on the "next page" button try: driver.find_element_by_xpath('.//li[@class="next"]//a').click() except NoSuchElementException: print("Scraping terminated before reaching target number of jobs. Needed {}, got {}.".format(num_jobs, len(jobs))) break return pd.DataFrame(jobs) #This line converts the dictionary object into a pandas DataFrame. ##Data Collection import glassdoor_scraper as gs import pandas as pd path = "C:/Users/Monica/Desktop/salary_prediction_app/chromedriver.exe" df = gs.get_jobs('data scientist',1000, False, path, 15) df.to_csv('glassdoor_jobs.csv', index = False)

Scraped and collected data in csv file – First, initialized web driver. WebDriver is an interface. An interface contains empty methods that have been defined but not implemented. And, since it has empty methods you won't actually need to instantiate it and so you cannot instantiate it. WebDriver is an interface provided by Selenium WebDriver. As we know that interfaces in Java are the collection of constants and abstract methods (methods without any implementation). The WebDriver interface serves as a contract that each browser-specific implementation like ChromeDriver, FireFoxDriver must follow. The interface allows sending a message to an object without concerning which classes it belongs. Class needs to provide functionality for the methods declared in the interface. Interfaces are useful because they provide contracts that objects can use to work together without needing to know anything else about each other. The point of interfaces is not to help you remember what method to implement, it is here to define a contract. Second, loaded the page. Then, tested the “sign in” prompt and got rid of it. Fourth, see through list of job. Fifth, set to “not found” value and it is an important step. Sixth, converted dictionary objects into pandas data frame so that can store collected data in csv file.

import pandas as pd 

df = pd.read_csv('glassdoor_jobs.csv')

#salary parsing 

df['hourly'] = df['Salary Estimate'].apply(lambda x: 1 if 'per hour' in x.lower() else 0)
df['employer_provided'] = df['Salary Estimate'].apply(lambda x: 1 if 'employer provided salary:' in x.lower() else 0)

df = df[df['Salary Estimate'] != '-1']
salary = df['Salary Estimate'].apply(lambda x: x.split('(')[0])
minus_Kd = salary.apply(lambda x: x.replace('K','').replace('$',''))

min_hr = minus_Kd.apply(lambda x: x.lower().replace('per hour','').replace('employer provided salary:',''))

df['min_salary'] = min_hr.apply(lambda x: int(x.split('-')[0]))
df['max_salary'] = min_hr.apply(lambda x: int(x.split('-')[1]))
df['avg_salary'] = (df.min_salary+df.max_salary)/2

#Company name text only
df['company_txt'] = df.apply(lambda x: x['Company Name'] if x['Rating'] <0 else x['Company Name'][:-3], axis = 1)

#state field 
df['job_state'] = df['Location'].apply(lambda x: x.split(',')[1])
df.job_state.value_counts()

df['same_state'] = df.apply(lambda x: 1 if x.Location == x.Headquarters else 0, axis = 1)

#age of company 
df['age'] = df.Founded.apply(lambda x: x if x <1 else 2020 - x)

#parsing of job description (python, etc.)

#python
df['python_yn'] = df['Job Description'].apply(lambda x: 1 if 'python' in x.lower() else 0)
 
#r studio 
df['R_yn'] = df['Job Description'].apply(lambda x: 1 if 'r studio' in x.lower() or 'r-studio' in x.lower() else 0)
df.R_yn.value_counts()

#spark 
df['spark'] = df['Job Description'].apply(lambda x: 1 if 'spark' in x.lower() else 0)
df.spark.value_counts()

#aws 
df['aws'] = df['Job Description'].apply(lambda x: 1 if 'aws' in x.lower() else 0)
df.aws.value_counts()

#excel
df['excel'] = df['Job Description'].apply(lambda x: 1 if 'excel' in x.lower() else 0)
df.excel.value_counts()

df.columns

df_out = df.drop(['Unnamed: 0'], axis =1)

df_out.to_csv('salary_data_cleaned.csv',index = False)   

Cleaned the collected data – Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. When combining multiple data sources, there are many opportunities for data to be duplicated or mis-labeled. To perform a Python data cleansing, you can drop the missing values, replace them, replace each NaN with a scalar value, or fill forward or backward. First, read stored csv file to clean the data. The parser itself is created from a grammar specification defined in the file Grammar/Grammar in the standard Python distribution. The parse trees stored in the ST objects created by this module are the actual output from the internal parser. A parser is a software component that takes input data (frequently text) and builds a data structure – often some kind of parse tree, abstract syntax tree or other hierarchical structure, giving a structural representation of the input while checking for correct syntax. Second, performed salary parsing. Third, performed operation like ‘apply’ on columns such as “company name”, “state”, “age” etc and stored data in csv file.

import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 

df = pd.read_csv('salary_data_cleaned.csv')

df.head()
df.columns

def title_simplifier(title):
    if 'data scientist' in title.lower():
        return 'data scientist'
    elif 'data engineer' in title.lower():
        return 'data engineer'
    elif 'analyst' in title.lower():
        return 'analyst'
    elif 'machine learning' in title.lower():
        return 'mle'
    elif 'manager' in title.lower():
        return 'manager'
    elif 'director' in title.lower():
        return 'director'
    else:
        return 'na'
    
def seniority(title):
    if 'sr' in title.lower() or 'senior' in title.lower() or 'sr' in title.lower() or 'lead' in title.lower() or 'principal' in title.lower():
            return 'senior'
    elif 'jr' in title.lower() or 'jr.' in title.lower():
        return 'jr'
    else:
        return 'na'

df['job_simp'] = df['Job Title'].apply(title_simplifier)

df.job_simp.value_counts()

df['seniority'] = df['Job Title'].apply(seniority)
df.seniority.value_counts()

df['job_state']= df.job_state.apply(lambda x: x.strip() if x.strip().lower() != 'los angeles' else 'CA')
df.job_state.value_counts()

df['desc_len'] = df['Job Description'].apply(lambda x: len(x))
df['desc_len']

df['num_comp'] = df['Competitors'].apply(lambda x: len(x.split(',')) if x != '-1' else 0)
df['Competitors']
df['min_salary'] = df.apply(lambda x: x.min_salary*2 if x.hourly ==1 else x.min_salary, axis =1)
df['max_salary'] = df.apply(lambda x: x.max_salary*2 if x.hourly ==1 else x.max_salary, axis =1)
df[df.hourly ==1][['hourly','min_salary','max_salary']]
df['company_txt'] = df.company_txt.apply(lambda x: x.replace('\n', ''))
df['company_txt']
df.describe()
df.columns
df.Rating.hist()
df.avg_salary.hist()
df.age.hist()
df.desc_len.hist()
df.boxplot(column = ['age','avg_salary','Rating'])
df.boxplot(column = 'Rating')
df[['age','avg_salary','Rating','desc_len']].corr()
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(df[['age','avg_salary','Rating','desc_len','num_comp']].corr(),vmax=.3, center=0, cmap=cmap,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

df.columns

df_cat = df[['Location', 'Headquarters', 'Size','Type of ownership', 'Industry', 'Sector', 'Revenue', 'company_txt', 'job_state','same_state', 'python_yn', 'R_yn',
       'spark', 'aws', 'excel', 'job_simp', 'seniority']]

for i in df_cat.columns:
    cat_num = df_cat[i].value_counts()
    print("graph for %s: total = %d" % (i, len(cat_num)))
    chart = sns.barplot(x=cat_num.index, y=cat_num)
    chart.set_xticklabels(chart.get_xticklabels(), rotation=90)
    plt.show()

for i in df_cat[['Location','Headquarters','company_txt']].columns:
    cat_num = df_cat[i].value_counts()[:20]
    print("graph for %s: total = %d" % (i, len(cat_num)))
    chart = sns.barplot(x=cat_num.index, y=cat_num)
    chart.set_xticklabels(chart.get_xticklabels(), rotation=90)
    plt.show()
    
df.columns

pd.pivot_table(df, index = 'job_simp', values = 'avg_salary')

pd.pivot_table(df, index = ['job_simp','seniority'], values = 'avg_salary')

pd.pivot_table(df, index = ['job_state','job_simp'], values = 'avg_salary').sort_values('job_state', ascending = False)

pd.options.display.max_rows
pd.set_option('display.max_rows', None)

pd.pivot_table(df, index = ['job_state','job_simp'], values = 'avg_salary', aggfunc = 'count').sort_values('job_state', ascending = False)

pd.pivot_table(df[df.job_simp == 'data scientist'], index = 'job_state', values = 'avg_salary').sort_values('avg_salary', ascending = False)

df.columns

df_pivots = df[['Rating', 'Industry', 'Sector', 'Revenue', 'num_comp', 'hourly', 'employer_provided', 'python_yn', 'R_yn', 'spark', 'aws', 'excel', 'Type of ownership','avg_salary']]

for i in df_pivots.columns:
    print(i)
    print(pd.pivot_table(df_pivots,index =i, values = 'avg_salary').sort_values('avg_salary', ascending = False))
    
pd.pivot_table(df_pivots, index = 'Revenue', columns = 'python_yn', values = 'avg_salary', aggfunc = 'count')

from wordcloud import WordCloud, ImageColorGenerator, STOPWORDS
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

words = " ".join(df['Job Description'])

def punctuation_stop(text):
    """remove punctuation and stop words"""
    filtered = []
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(text)
    for w in word_tokens:
        if w not in stop_words and w.isalpha():
            filtered.append(w.lower())
    return filtered


words_filtered = punctuation_stop(words)

text = " ".join([ele for ele in words_filtered])

wc= WordCloud(background_color="white", random_state=1,stopwords=STOPWORDS, max_words = 2000, width =800, height = 1500)
wc.generate(text)

plt.figure(figsize=[10,10])
plt.imshow(interpolation="bilinear")
plt.axis('off')
plt.show()          

Performed EDA – The primary goal of EDA is to maximize the analyst's insight into a data set and into the underlying structure of a data set, while providing all of the specific items that an analyst would want to extract from a data set, such as: a good-fitting, parsimonious model and a list of outliers. There are many libraries available in python like pandas, NumPy, matplotlib, seaborn to perform EDA. The four types of EDA are univariate non-graphical, multivariate non- graphical, univariate graphical, and multivariate graphical. First, read the cleaned data csv file. Second, analysed the data through “.columns, .value_counts” and other operations. Third, plotted data through histograms, box plots and heat maps. A correlation heatmap uses colored cells, typically in a monochromatic scale, to show a 2D correlation matrix (table) between two discrete dimensions. Correlation ranges from -1 to +1. Values closer to zero means there is no linear trend between the two variables. The close to 1 the correlation is the more positively correlated they are; that is as one increases so does the other and the closer to 1 the stronger this relationship is. A box plot is a method for graphically depicting groups of numerical data through their quartiles. The box extends from the Q1 to Q3 quartile values of the data, with a line at the median (Q2). The whiskers extend from the edges of box to show the range of the data. A box and whisker plot is a way of summarizing a set of data measured on an interval scale. It is often used in explanatory data analysis. This type of graph is used to show the shape of the distribution, its central value, and its variability. a histogram is representation of the distribution of numerical data, where the data are binned and the count for each bin is represented. More generally, in plotly a histogram is an aggregated bar chart, with several possible aggregation functions (e.g. sum, average, count). The purpose of a histogram (Chambers) is to graphically summarize the distribution of a univariate data set. It is used to summarize discrete or continuous data that are measured on an interval scale.

## Pipe Building
import pandas as pd 
import numpy as np

from sklearn.pipeline import Pipeline,FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin

class VarSelector(BaseEstimator, TransformerMixin):
    
    def __init__(self,var_names,drop_var=False):
        self.vars=var_names
        self.drop_var=drop_var
    
    def fit(self,x,y=None):
        return self
    
    def transform(self,X):
        if self.drop_var:
            return X.drop(self.vars,1)
        else:
            return X[self.vars]



class string_clean(BaseEstimator, TransformerMixin):
    
    def __init__(self,replace_it='',replace_with=''):
        self.replace_it=replace_it
        self.replace_with=replace_with
    
    def fit(self,x,y=None):
        return self
    
    def transform(self,X):
        for col in X.columns:
            X[col]=X[col].str.replace(self.replace_it,self.replace_with)
        return X



class convert_to_numeric(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        pass
    
    def fit(self,x,y=None):
        return self
    
    def transform(self,X):
        for col in X.columns:
            X[col]=pd.to_numeric(X[col],errors='coerce')
        return X


class get_dummies_Pipe(BaseEstimator, TransformerMixin):
    
    def __init__(self,freq_cutoff=0):
        self.freq_cutoff=freq_cutoff
        self.var_cat_dict={}
        
    def fit(self,x,y=None):
        data_cols=x.columns

        for col in data_cols:
            k=x[col].value_counts()
            if (k<=self.freq_cutoff).sum()==0:
                cats=k.index[k>self.freq_cutoff][:-1]

            else:
                cats=k.index[k>self.freq_cutoff]

            self.var_cat_dict[col]=cats
        return self
            
    def transform(self,x,y=None):
        dummy_data=x.copy()
        for col in self.var_cat_dict.keys():
            for cat in self.var_cat_dict[col]:
                name=col+'_'+cat
                dummy_data[name]=(dummy_data[col]==cat).astype(int)
            del dummy_data[col]
        return dummy_data

        


class custom_fico(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        pass
    
    def fit(self,x,y=None):
        return self
    
    def transform(self,X):
        k=X['FICO.Range'].str.split("-",expand=True).astype(float)
        X['fico']=0.5*(k[0]+k[1])
        del X['FICO.Range']
        return X



## Model Building 
import pandas as pd
file=r'/Users/Monica/Desktop/salary_prediction_app/eda_data.csv'
train=pd.read_csv(file)
train=train.iloc[:,1:]
import re
k=[re.sub('[$A-Za-z\(\)\.:]','',x) for x in train['Salary Estimate']]
train['Salary Estimate']=pd.Series(k).str.split('-',expand=True).astype(float).mean(axis=1)
train['Company Name']=train['Company Name'].str.split('\n',expand=True).iloc[:,0]
train.columns
drop_cols=['Job Description','seniority','job_simp','min_salary','max_salary','avg_salary','company_txt','desc_len']
train.drop(drop_cols,1,inplace=True)
train.shape
train['Size'].value_counts()
train.to_csv('simple_data.csv',index=False)
from mypipes import *
cat_vars=list(train.select_dtypes(['object']).columns)
num_vars=list(train.select_dtypes(exclude=['object']).columns)
num_vars.remove('Salary Estimate')num_vars=list(train.select_dtypes(exclude=['object']).columns)
num_vars.remove('Salary Estimate')
cat_vars
cat_pipe=Pipeline([
    ('cat_var_select',VarSelector(cat_vars)),
    ('create_dummies',get_dummies_Pipe(74))
])
data_pipe=FeatureUnion([
    ('num_vars',VarSelector(num_vars)),
    ('cat_data',cat_pipe)
])
from sklearn.linear_model import LinearRegression
complete_pipe=Pipeline([
    ('data_pipe',data_pipe),
    ('model',LinearRegression(fit_intercept=True))
])
target='Salary Estimate'
y=train[target]
x=train.drop(target,1)

complete_pipe.fit(x,y)

from sklearn.externals import joblib
joblib.dump(complete_pipe,'model_pipeline.pkl')

x[num_vars].describe()

x.columns                

Built model – First, read the eda csv file. Second, dropped columns those were not required. Third, created and stored data in new csv file. ‘.get_dummies’ will convert your categorical string values into dummy variables. ‘pd.get_dummies’ create a new data frame containing unique values as columns which consists of zeros and ones. Fourth, created dummy variables using “.get_dummies()” function. It provides end-to-end velocity by eliminating errors and combatting bottlenecks or latency. It can process multiple data streams at once. In short, it is an absolute necessity for today's data-driven enterprise. A data pipeline views all data as streaming data and it allows for flexible schemas. A pipeline consists of a sequence of stages. There are two basic types of pipeline stages: Transformer and Estimator. A Transformer takes a dataset as input and produces an augmented dataset as output. Fifth, created pipeline and fitted a linear model on trained data and for the same. The fit() method takes the training data as arguments, which can be one array in the case of unsupervised learning, or two arrays in the case of supervised learning. Sixth, saved the model as pickle file to re-use it.

import streamlit as st
import pandas as pd

st.title('Predicting Salaries')

default_dd=pd.read_csv('/Users/Monica/Desktop/salary_prediction_app/simple_data.csv')

_job_title=list(default_dd['Job Title'].unique())
_company_name=list(default_dd['Company Name'].unique())
_location=list(default_dd['Location'].unique())
_hq=list(default_dd['Headquarters'].unique())
_size=list(default_dd['Size'].unique())
_ownership=list(default_dd['Type of ownership'].unique())
_industry=list(default_dd['Industry'].unique())
_sector=list(default_dd['Sector'].unique())
_revenue=list(default_dd['Revenue'].unique())
_competitors=list(default_dd['Competitors'].unique())
_job_state=list(default_dd['job_state'].unique())

job_title=st.selectbox('Select Job Title',options=_job_title)
company_name=st.selectbox('Select Company Nmae',options=_company_name)
location=st.selectbox('Select Location',options=_location)
hq=st.selectbox('Select Headquarters',options=_hq)
size=st.selectbox('Select Size',options=_size)
ownership=st.selectbox('Select Type of ownership',options=_ownership)
industry=st.selectbox('Select Industry',options=_industry)
sector=st.selectbox('Select Sector',options=_sector)
revenue=st.selectbox('Select Revenue',options=_revenue)
competitors=st.selectbox('Select Competitors',options=_competitors)
job_state=st.selectbox('Select job_state',options=_job_state)

rating=st.slider('Select Rating',float(default_dd['Rating'].min()),float(default_dd['Rating'].max()))
founded=st.slider('Select Founded',float(default_dd['Founded'].min()),float(default_dd['Founded'].max()))
hourly=st.selectbox('Select hourly',options=[0,1])
employer_provided=st.selectbox('Select employer_provided',options=[0,1])
same_state=st.selectbox('Select same_state',options=[0,1])
age=founded=st.slider('Select age',float(default_dd['age'].min()),float(default_dd['age'].max()))
python_yn=st.selectbox('Select python_yn',options=[0,1])
R_yn=st.selectbox('Select R_yn',options=[0,1])
spark=st.selectbox('Select spark',options=[0,1])
aws=st.selectbox('Select aws',options=[0,1])
excel=st.selectbox('Select excel',options=[0,1])
num_comp=st.selectbox('Select num_comp',options=list(default_dd['num_comp'].unique()))


x=pd.DataFrame({'Job Title':[job_title], 
	'Rating':[rating], 
	'Company Name':[company_name], 
	'Location':[location], 
	'Headquarters':[hq],
    'Size':[size], 
    'Founded':[founded], 
    'Type of ownership':[ownership], 
    'Industry':[industry], 
    'Sector':[sector], 
    'Revenue':[revenue],
    'Competitors':[competitors],
    'hourly':[hourly], 
    'employer_provided':[employer_provided], 
    'job_state':[job_state], 
    'same_state':[same_state],
    'age':[age], 
    'python_yn':[python_yn], 
    'R_yn':[R_yn], 
    'spark':[spark], 
    'aws':[aws], 
    'excel':[excel], 
    'num_comp':[num_comp]})

from sklearn.externals import joblib

model=open('model_pipeline.pkl','rb')
model=joblib.load(model)

st.title('Your Predicted Salary is :')
st.info('$'+str(int(model.predict(x)[0]))+'K')         

Generated web app – Built web app in streamlit for end-users.

Challenge that I faced in this project was while scrapping data and to automate the things with selenium and secondly while building web app when I used request.get() could work as it should.

Creation of App

Here, I am created Streamlit App. Created data frame for required fields. And then predicted the salary value.

Technical Aspect

Numpy used for working with arrays. It also has functions for working in domain of linear algebra, fourier transform, and matrices. It contains a multi-dimensional array and matrix data structures. It can be utilised to perform a number of mathematical operations on arrays.

Pandas module mainly works with the tabular data. It contains Data Frame and Series. Pandas is 18 to 20 times slower than Numpy. Pandas is seriously a game changer when it comes to cleaning, transforming, manipulating and analyzing data.

Matplotlib is used for EDA. Visualization of graphs helps to understand data in better way than numbers in table format. Matplotlib is mainly deployed for basic plotting. It consists of bars, pies, lines, scatter plots and so on. Inline command display visualization inline within frontends like in Jupyter Notebook, directly below the code cell that produced it.

Seaborn provides a high-level interface for drawing attractive and informative statistical graphics. It provides a variety of visualization patterns and visualize random distributions.

Sklearn is known as scikit learn. It provides many ML libraries and algorithms for it. It provides a range of supervised and unsupervised learning algorithms via a consistent interface in Python.

Need to train_test_split - Using the same dataset for both training and testing leaves room for miscalculations, thus increases the chances of inaccurate predictions. The train_test_split function allows you to break a dataset with ease while pursuing an ideal model. Also, keep in mind that your model should not be overfitting or underfitting.

Joblib is used in order to identify the location of the program to be executed in a JCL. The JOBLIB statement is specified after the JOB statement and before the EXEC statement. This can be used only for the in stream procedures and programs. Joblib is a set of tools to provide lightweight pipelining in Python. In particular: transparent disk-caching of functions and lazy re-evaluation (memoize pattern) easy simple parallel computing. Joblib is optimized to be fast and robust in particular on large data and has specific optimizations for numpy arrays.

Selenium is a strong set of tools that firmly supports the quick development of test automation of web applications. It offers a set of testing functions that are specially designed to the requirements of testing of a web application. The structured automation testing life cycle comprises of a multi-stage process that supports the activities required to utilize and introduce an automated test tool, develop and run test cases, develop test design, build and handle test data and environment.

Time provides functionality other than representing time, like waiting during code execution and measuring the efficiency of your code.

Wordcloud is a visual representation of text data. It displays a list of words, the importance of each beeing shown with font size or color. This format is useful for quickly perceiving the most prominent terms.

Nltk library in Natural Language Processing with Python provides a practical introduction to programming for language processing. The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries.

Streamlit is an open-source Python library that makes it easy to create and share beautiful, custom web apps for machine learning and data science. In just a few minutes you can build and deploy powerful data apps. It is a very easy library to create a perfect dashboard by spending a little amount of time. It also comes with the inbuilt webserver and lets you deploy in the docker container. When you run the app, the localhost server will open in your browser automatically.

Pickle in Python is primarily used in serializing and deserializing a Python object structure. In other words, it's the process of converting a Python object into a byte stream to store it in a file/database, maintain program state across sessions, or transport data over the network.

Linear regression is the next step up after correlation. It is used when we want to predict the value of a variable based on the value of another variable. The variable we want to predict is called the dependent variable. Because the model is based on the equation of a straight line, y=a+bx, where a is the y-intercept (the value of y when x=0) and b is the slope (the degree to which y increases as x increases one unit). Linear regression plots a straight line through a y vs. x scatterplot. That is why it is call linear regression. Simple linear regression is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables: One variable, denoted x, is regarded as the predictor, explanatory, or independent variable. The goal of multiple linear regression (MLR) is to model the linear relationship between the explanatory (independent) variables and response (dependent) variable. In essence, multiple regression is the extension of ordinary least-squares (OLS) regression that involves more than one explanatory variable.

Installation

Using intel core i5 9th generation with NVIDIA GFORECE GTX1650.

Windows 10 Environment Used.

Already Installed Anaconda Navigator for Python 3.x

The Code is written in Python 3.8.

If you don't have Python installed you can install Python from its official site.

If you are using a lower version of Python you can upgrade using the pip package, ensuring you have the latest version of pip, python -m pip install --upgrade pip and press Enter.

Run-How to Use-Steps

Keep your internet connection on while running or accessing files and throughout too.

Follow this when you want to perform from scratch.

Open Anaconda Prompt, Perform the following steps:

cd

pip install pandas

pip install matplotlib

pip install seaborn

pip install numpy

Note: If it shows error as ‘No Module Found’ , then install relevant module.

You can also create requirement.txt file as, pip freeze > requirements.txt

cd

run .py or .ipynb files.

Paste URL to browser to check whether working locally or not.

Follow this when you want to just perform on local machine.

Download ZIP File.

Right-Click on ZIP file in download section and select Extract file option, which will unzip file.

Move unzip folder to desired folder/location be it D drive or desktop etc.

Open Anaconda Prompt, write cd and press Enter.

eg: cd C:\Users\Monica\Desktop\Projects\Python Projects 1\ 23)End_To_End_Projects\Project_3_ML_Scrape_To_Deployment_EndToEnd_SalaryPrediction\Project_ML_SalaryPrediction

In Anconda Prompt, pip install -r requirements.txt to install all packages.

In Anconda Prompt, write streamlit run salary_prediction.py and press Enter.

Paste URL to browser to check whether working locally or not.

Please be careful with spellings or numbers while typing filename and easier is just copy filename and then run it to avoid any silly errors.

Note: cd

[Go to Folder where file is. Select the path from top and right-click and select copy option and paste it next to cd one space and press enter, then you can access all files of that folder] [cd means change directory]

Directory Tree-Structure of Project

To Do-Future Scope

Can make more customize search.

Technologies Used-System Requirements-Tech Stack

Download the Material

project

modelbuilding

savedmodel

appfile

requirements

detailedwebsite

Conclusion

Modeling

Can also apply random forest regression to check if any larger improvement.

Analysis

Created pipeline and then built model which makes it easy for any future re-use.

Credits

Ken Jee Channel

Paper Citation

 Paper Citation here